library(here)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(ggfortify)
library(readxl)
library(here)
library(rlang)
library(rsample)
library(parsnip)
library(yardstick)
library(markdown)
library(knitr)
library(kableExtra)
library(gridExtra)
library(grid)
library(forcats)
library(plotly)
library(foreach)
library(doParallel)
library(scatterplot3d)
library(kknn)This document presents a data-driven approach to optimizing the methodology for estimating an energy baseline (LBEn) and evaluating energy consumption performance in wastewater treatment plants. The analysis focuses on the application of data analysis and predictive modeling techniques in R to improve the accuracy, robustness, and scalability of energy consumption estimates.
The study examines and compares different predictive approaches, including regression-based models and k-nearest neighbors, in order to identify the most appropriate method for forecasting electrical energy consumption. It also addresses key aspects such as data preprocessing, the definition of training and testing datasets, result visualization, and model performance evaluation.
In the context of industrial process optimization, this project emphasizes the advantages of transitioning from manual, spreadsheet-based methodologies to reproducible and scalable analytical solutions developed in R. This shift enables more consistent performance monitoring and supports more reliable, data-informed decision-making, while preserving the confidentiality and anonymity of the underlying operational data. All datasets used in this project are anonymized and consist of randomly generated data based on realistic parameters and representative figures from industrial operations in Chile, ensuring methodological validity without disclosing sensitive or identifiable information.
#Load Data
dataWW <- read_excel("dataWW.xlsx")
View(dataWW)
head(dataWW) # Visualizing the first rows of the dataset## # A tibble: 6 × 18
## YEAR MONTH Volm3 DO `Reactor T°` `Sludge blanket height` `Settling test`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 1 2341427 1 17 130 428
## 2 2021 2 2310871 2 20 133 425
## 3 2021 3 3066422 1 17 155 268
## 4 2021 4 2417132 2 15 167 333
## 5 2021 5 3447598 1 20 81 315
## 6 2021 6 2731841 2 15 70 326
## # ℹ 11 more variables: `Real Ton` <dbl>, `Weighted moisture` <dbl>,
## # `Tons of Dry Solids (DS)` <dbl>, MLTSS <dbl>,
## # `Chemical Oxygen Demand (CDO)` <dbl>,
## # `Biochemical Oxygen Demand (BOD)` <dbl>,
## # `Total Kjeldahl Nitrogen (TKN)` <dbl>, `BOD5 load` <dbl>, `TKN load` <dbl>,
## # `Carga DBO5 + NKT` <dbl>, kWh <dbl>
For practical purposes, the models are applied to a single treatment plant, which serves as an example and can later be replicated across the remaining plants. The production system used for demonstration purposes corresponds to the Cadam wastewater treatment plant.
## # A tibble: 6 × 18
## YEAR MONTH Volm3 DO `Reactor T°` `Sludge blanket height` `Settling test`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 1 2341427 1 17 130 428
## 2 2021 2 2310871 2 20 133 425
## 3 2021 3 3066422 1 17 155 268
## 4 2021 4 2417132 2 15 167 333
## 5 2021 5 3447598 1 20 81 315
## 6 2021 6 2731841 2 15 70 326
## # ℹ 11 more variables: `Real Ton` <dbl>, `Weighted moisture` <dbl>,
## # `Tons of Dry Solids (DS)` <dbl>, MLTSS <dbl>,
## # `Chemical Oxygen Demand (CDO)` <dbl>,
## # `Biochemical Oxygen Demand (BOD)` <dbl>,
## # `Total Kjeldahl Nitrogen (TKN)` <dbl>, `BOD5 load` <dbl>, `TKN load` <dbl>,
## # `Carga DBO5 + NKT` <dbl>, kWh <dbl>
To facilitate the analysis, variable names are modified to a simpler syntax that ensures compatibility with the models applied in subsequent stages of the study.
The target variable to be predicted is Energy (kWh),
while the input variables considered for this specific case are treated
water volume measured in cubic meters (Volm3), Dissolved
Oxygen (DO), and Mixed Liquor Total Suspended Solids
(MLTSS).
For the variable kWh, the distribution is first
visualized graphically by year.
kwh_year1 <- cadam2%>%filter(YEAR== "2021")
kwh_year2 <- cadam2%>%filter(YEAR== "2022")
kwh_year3 <- cadam2%>%filter(YEAR== "2023")
cadam2 %>%
ggplot(aes(x = kWh, fill = as.factor(YEAR))) +
geom_density(alpha = 0.4) +
geom_vline(aes(xintercept = mean(kwh_year1$kWh)),
linetype = "dashed",
color = "#4575b4",
size = 1) +
geom_vline(aes(xintercept = mean(kwh_year2$kWh)),
linetype = "dashed",
color = "#91bfdb",
size = 1) +
geom_vline(aes(xintercept = mean(kwh_year3$kWh)),
linetype = "dashed",
color = "#313695",
size = 1) +
labs(x = "Electric energy consumption (kWh)",
y = "Density f(x)",
title = "Annual distribution of kWh") +
scale_fill_manual(values = c("2021" = "#4575b4", "2022" = "#91bfdb", "2023" = "#313695")) +
theme_bw()From the plot, it can be observed that for 2022 the distribution of the variable closely resembles a normal distribution, with the exception of left-side outliers. The vertical lines represent the mean values for the years. For 2021 and 2023 the data appear to be more dispersed to the sides.
Nevertheless, in order to allow the models to learn from a larger number of observations, the kWh variable is analyzed without annual segmentation. Consequently, it is necessary to visualize its overall density distribution.
cadam2 %>%
ggplot() +
geom_density(aes(x = kWh), size = 1, alpha = 0.3, fill = "steelblue") +
geom_vline(xintercept = mean(cadam2$kWh), linetype = "dashed", color = "darkblue", size = 1) +
labs(x = "Electric energy consumption (kWh)", y = "Density f(x)", title = "kWh distribution") +
theme_bw()
The overall density of
kWh approximates a normal
distribution, although not perfectly. Additionally, as shown in the
statistical summary presented in Section 3.2, the mean value is close to
the median.
The same exploratory analysis is then applied to the remaining two variables.
cadam2 %>%
ggplot() +
geom_density(aes(x = Volm3), size = 1, alpha = 0.3, fill = "steelblue") +
geom_vline(xintercept = mean(cadam2$Volm3), linetype = "dashed", color = "darkblue", size = 1) +
labs(x = "Volume", y = "Densidad f(x)", title = "Distribution of VolASm3") +
theme_bw()cadam2 %>%
ggplot() +
geom_density(aes(x = DO), size = 1, alpha = 0.3, fill = "steelblue") +
geom_vline(xintercept = mean(cadam2$DO), linetype = "dashed", color = "darkblue", size = 1) +
labs(x = "Dissolved Oxygen", y = "Density f(x)", title = "DO distribution") +
theme_bw()KEY FINDINGS: As can be observed, variables that are directly related to chemical treatment processes do not exhibit density distributions that resemble a normal distribution. For this reason, linear regression models may not be the most appropriate approach.
To identify the variables that contribute to predicting energy
consumption in kWh, those variables were selected which, according to
the literature and observations from wastewater treatment processes, are
expected to have an impact on the dependent variable
(kWh).
First, a random seed is set to ensure reproducibility, and the data are initially partitioned into 80% training and 20% testing sets. The resulting output illustrates how the data were split.
set.seed(438)
splits <- initial_split(data = cadam2,
prop = 0.8) #Data partitioning using an 80% proportion
splits #Visualize## <Training/Testing/Total>
## <28/8/36>
For model analysis, 28 observations are used for training and 8 observations for testing. Due to the limited sample size, stratification is not applied; however, it is recommended when working with larger datasets.
Subsequently, the training dataset is stored for later use.
At the same time, the evaluation dataset is stored for later use.
Next, 10 partitions are defined for the training data using 10-fold cross-validation. Stratification is attempted; however, a warning is generated due to the limited sample size. This limitation can be addressed when working with a larger volume of historical data. The output displays the defined partitions.
## Warning: The number of observations in each quantile is below the recommended threshold
## of 20.
## • Stratification will use 1 breaks instead.
## Warning: Too little data to stratify.
## • Resampling will be unstratified.
In order to compare the current approach with the proposed
methodology, multiple linear regression models are first implemented.
The model specification is created using the linear_reg()
function from the parsnip package. The output displays
the linear model specification, which is subsequently used to fit and
evaluate the linear regression model.
#Define the model specification
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")Fit a linear regression model for electrical energy consumption
(kWh) as a function of official treated water volume
(Volm3) and dissolved oxygen (OD) using the
training data.
#Fit the linear regression mode
lm_fit <- lm_spec %>%
#Functional form y = f(x) = b_0+b_1*X
fit(`kWh` ~ `Volm3`+`DO`, data = train) %>% ## Load training dataset
step_normalize(c(`Volm3`,`DO`))
lm_fit ## parsnip model object
##
##
## Call:
## stats::lm(formula = kWh ~ Volm3 + DO, data = data)
##
## Coefficients:
## (Intercept) Volm3 DO
## 7.418e+05 2.161e-02 -6.186e+03
Functional form:
\[\text{Electrical energy (kWh)} = \beta_0 + \beta_1 \cdot \text{Treated water volume} + \beta_2 \cdot \text{Dissolved oxygen}\]
\[\text{Electrical energy (kWh)} = 741800 + 0.02161 \cdot \text{Volm3} - 6186 \times 10^{3} \cdot \text{DO}\] To visualize a statistical summary of the fitted linear regression model, including coefficients, standard errors, t-values, and p-values, you should use:
##
## Call:
## stats::lm(formula = kWh ~ Volm3 + DO, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85298 -35679 5255 35832 82116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.418e+05 6.060e+04 12.242 4.68e-12 ***
## Volm3 2.161e-02 1.927e-02 1.121 0.273
## DO -6.186e+03 1.846e+04 -0.335 0.740
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47630 on 25 degrees of freedom
## Multiple R-squared: 0.05375, Adjusted R-squared: -0.02195
## F-statistic: 0.71 on 2 and 25 DF, p-value: 0.5013
Then, the model results are converted into a tibble for a cleaner and more structured visualization.
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 741797. 60595. 12.2 4.68e-12
## 2 Volm3 0.0216 0.0193 1.12 2.73e- 1
## 3 DO -6186. 18464. -0.335 7.40e- 1
This way, the model predictions can be visualized on the training dataset.
## # A tibble: 28 × 1
## .pred
## <dbl>
## 1 800670.
## 2 786639.
## 3 793946.
## 4 802974.
## 5 785218.
## 6 788454.
## 7 810032.
## 8 781654.
## 9 801870.
## 10 786205.
## # ℹ 18 more rows
Next, the fitted values (predictions) are calculated and displayed, and then joined with the original dataset, including the input variables (VolASm3 and OD) and the actual value of the dependent variable (kWh).
# Obtain fitted values (predictions): estimated kWh and join with dataset
fitted_values <- dplyr::bind_cols(
predict(lm_fit, new_data = train),
train
) %>%
# Select columns of interest
select(.actual = `kWh`, # Actual value (y)
.pred, # Prediction (estimated y)
`Volm3`,`DO`) # Input variables (x)A 3D plot is presented now, comparing the actual values with the model predictions as a function of two input variables ´Volm3´ y ´DO´.
scatter_3d1 <- scatterplot3d(
x = fitted_values$Volm3,
y = fitted_values$DO,
z = fitted_values$.actual,
color = "steelblue",
pch = 16,
main = "3D Linear Regression Plot 1: Actual vs. Prediction",
xlab = "Official volume (m3)",
ylab = "Dissolved oxygen",
zlab = "Electric energy (kWh)"
)
# Add predictions to the plot
scatter_3d1$points3d(
x = fitted_values$Volm3,
y = fitted_values$DO,
z = fitted_values$.pred,
col = "darkblue",
pch = 16
)Finally, to obtain performance metrics, residuals are calculated by subtracting the model predictions.
# Obtain residual values
fitted_values1<- fitted_values%>%
mutate(.residual = .actual - .pred) # Actual - PredictionFor a more visual result of the residuals, a residual scatter plot is created using ggplot2. The x-axis shows the fitted values for electric energy ´kWh´ and the y-axis represents the residuals.
fitted_values1 %>%
ggplot2::ggplot(aes(x = .pred, y = .residual)) +
geom_point(alpha = 0.8, size = 2) +
labs(
x = "Fitted values for Electric Energy (kWh)",
y = "Residuals"
) +
geom_hline(yintercept = 0, linetype = "dashed")## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 37544.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 45004.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.0537
Interpretation
RMSE: The best result is when, on average, the model in training fails by 45004 kWh when predicting..
MAE: The best result is when, in absolute terms, the model in training fails by 37543.72 kWh when predicting.
RSQ: The input variables explain 5.3% of the variance.
#Fit the linear regression
lm_fit_test <- lm_spec %>%
fit(`kWh` ~ `Volm3`+`DO`, data = test) %>% #Load the training dataset
step_normalize(c(`Volm3`,`DO`))
lm_fit_test## parsnip model object
##
##
## Call:
## stats::lm(formula = kWh ~ Volm3 + DO, data = data)
##
## Coefficients:
## (Intercept) Volm3 DO
## 8.579e+05 1.534e-02 -7.331e+04
Functional form:
\[\text{Electrical energy (kWh)}= 857900 +\ 0.01534 \cdot \text{Volm3}-\ 73310 \cdot\text{DO}\] To validate the model’s performance, the linear regression model (lm_spec) should be fitted using the test data (test).
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 857886. 177314. 4.84 0.00472
## 2 Volm3 0.0153 0.0565 0.271 0.797
## 3 DO -73307. 50542. -1.45 0.207
## # A tibble: 8 × 1
## .pred
## <dbl>
## 1 746719.
## 2 837462.
## 3 827026.
## 4 760073.
## 5 815616.
## 6 829478.
## 7 759051.
## 8 749988.
#Obtain fitted values (predictions): estimated kWh and join with dataset
fitted_values_test <- dplyr::bind_cols(
predict(lm_fit_test, new_data = test),
test
) %>%
#Select columns of interest
select(.actual = `kWh`, #Actual value (y)
.pred, #Prediction (estimated)
`Volm3`,`DO`) #Input variables (x)#Obtain residual values
fitted_values2 <- fitted_values_test %>%
mutate(.residual = .actual - .pred) #Actual - Prediction## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 47734.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 56505.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.304
Interpretation
RMSE: The best result is when, on average, the model in evaluation fails by 56505.41 kWh when predicting.
MAE: The best result is when, in absolute terms, the model in evaluation fails by 47733.76 kWh when predicting.
RSQ: The input variables explain 30.41% of the variance.
Finding: The metrics for the test set are almost double those of the training set for RMSE and MAE, while R² is significantly higher in the test set. Therefore, the model is not sufficiently stable.
Fit a linear regression model of kWh as a function of
Volm3, DO, and MLTSS using the
training data.
Functional form:
\[\text{Electrical energy (kWh)}=\beta_0+\beta_1\cdot\text{Treated water volume}+\beta_2\cdot\text{Dissolved oxygen}\\+\beta_3\cdot\text{Mixed Liquor Total Suspended Solids}\] \[\text{Electrical energy (kWh)}=\ 814900+\ 0.01996 \cdot\text{Treated water volume}−2,588\cdot\text{Dissolved oxygen}\\−31.69\cdot\text{Mixed Liquor Total Suspended Solids}\]
#Obtain residual values
fitted_values3 <- fitted_values3 %>%
mutate(.residual = .actual - .pred) # Actual - Predictionfitted_values3 %>%
ggplot2::ggplot(aes(x = .pred, y = .residual)) +
geom_point(alpha = 0.8, size = 2) +
labs(
x = "Fitted values for Electric Energy (kWh)",
y = "Residuals"
) +
geom_hline(yintercept = 0, linetype = "dashed")## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 34951.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 42523.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.155
Interpretation
RMSE: On average, the training model fails by 42522.57 kWh when predicting.
MAE: In absolute terms, the model fails by 34951.19 kWh when predicting.
RSQ: Input variables explain 15.52% of the variance.
In training, the three-variable model performs worse than the two-variable model.
#Fit linear regression on test data
lm_fit_test3 <-linear_reg(penalty = 1, mixture = 1) %>%
fit(`kWh` ~ `Volm3`+`DO`+`MLTSS`, data = test) %>% # Load test dataset
step_normalize(c(`Volm3`,`DO`,`MLTSS`))
lm_fit_test3## parsnip model object
##
##
## Call:
## stats::lm(formula = kWh ~ Volm3 + DO + MLTSS, data = data)
##
## Coefficients:
## (Intercept) Volm3 DO MLTSS
## 1.141e+06 3.618e-02 -6.910e+04 -1.309e+02
Forma funcional:
\[\text{Electrical energy (kWh)}= 1,141,000+0.03618\cdot\text{Treated water volume}−69,100\cdot\text{Dissolved oxygen}\\−130.9\cdot\text{Mixed Liquor Total Suspended Solids}\]
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 26133.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 30131.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.802
Interpretation
RMSE: On average, the evaluation model fails by 30130.96 kWh when predicting.
MAE: In absolute terms, the model fails by 26132.64 kWh when predicting.
RSQ: Input variables explain 80.21% of the variance.
Findings: Comparing training and test results for the three-variable linear regression model, performance is better on the test data. This indicates overfitting when using linear regressions.
The second methodology used is k-nearest neighbors (k-NN). This method predicts the value of a new data point based on the k closest historical points according to Euclidean distance.
Step by step:
First, specify how the model will be fitted using a recipe:
Next, define the type of model to be used (k-NN):
Split the data into ten folds for cross-validation:
Set up the workflow:
Create a grid for the number of neighbors (100 models with different k):
Perform cross-validation to find the best model:
## → A | warning: A correlation computation is required, but `estimate` is constant and has 0
## standard deviation, resulting in a divide by 0 error. `NA` will be returned.
## There were issues with some computations A: x1There were issues with some computations A: x11There were issues with some computations A: x11
Visualize RMSE for each neighbor count:
For a more visual result, the mean squared error can be plotted for each neighbor used in the cross-validation. The lowest points on the plot indicate the best model performance, that is, the optimal number of neighbors to use.
knn_result%>%
ggplot(aes(x=neighbors, y=mean)) +
geom_point()+
geom_line()+
labs(
x="Neighbors (k)",
y= "RMSE"
)
- Then, the lowest point on the plot should be obtained using:
## # A tibble: 1 × 7
## neighbors .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 rmse standard 44855. 10 5730. pre0_mod09_post0
knn_min <- knn_result%>%
filter(mean == min(knn_result$mean[!is.na(knn_result$mean)]))%>%
pull(neighbors)knn_spec_best<-nearest_neighbor(
weight_func = "rectangular",
neighbors = knn_min
)%>%
set_engine("kknn")%>%
set_mode("regression")## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 65012.
** Finding**: The model on the test data shows a higher RMSE than on the training data, which indicates the absence of overfitting and confirms the model’s functionality.
To visualize the fitted prediction curve created by the model, a 3D plot representation will be used.
x_values<- seq(from =min(train$Volm3, na.rm = TRUE),
to=max(train$Volm3, na.rm = TRUE), length=32)
y_values<- seq(from =min(train$DO, na.rm = TRUE),
to=max(train$DO, na.rm = TRUE), length=32)
z_values<-knn_fit_best%>%
predict(crossing(x_values, y_values)%>%
mutate(Volm3= x_values, DO=y_values))%>%
pull(.pred)
z_matrix <- matrix(z_values, ncol = length(y_values), byrow = TRUE)
plot_3d <- plot_ly() %>%
add_markers(
data = train %>% filter(!is.na(DO)),
x = ~Volm3,
y = ~DO,
z = ~kWh,
marker = list(size = 2, opacity = 0.4, color = "red")
) %>%
add_surface(
x = ~x_values,
y = ~y_values,
z = ~matrix(z_values, nrow = length(x_values), ncol = length(y_values)),
colorbar = list(title = "Electrical energy (kWh)")
) %>%
layout(scene = list(
xaxis = list(title = "Treated water volume"),
zaxis = list(title = "Electrical energy (kWh)"),
yaxis = list(title = "Dissolved oxygen")
))
plot_3dJust like with k-nearest neighbors, there are other methodologies for making predictions in regression problems, one of which is decision trees.
For this methodology, the same variables as before will also be used.
tree_spec <- decision_tree(
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
#Considering that all variables are numeric, a regression model is used.
set_mode("regression") %>%
set_engine("rpart")tree_grilla <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 6) #Four levels were used, as it is the most suitable given the computational capacity in which the model is being executedrecipe_carreras <- recipe(`kWh` ~ `Volm3`+`DO`+`MLTSS`,data = train) %>% #Define input and output variables
step_normalize(c(`Volm3`,`DO`)) #Normalize all numeric variables, since they have different scalesdoParallel::registerDoParallel() #Ensure that the models run in parallel
regression_metrics <- metric_set(rmse,mae) #Define performance metrics
set.seed(438)
tree_result <- tree_spec%>%
tune_grid(
preprocessor = recipe_carreras, #Preprocessing
resamples = vfold, #Partitions
grid = tree_grilla, #Decision tree grid
metrics = regression_metrics, #Performance metrics
control = control_resamples(verbose = TRUE, save_pred = TRUE) # Storage
)## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 1 32 rmse standard 50528. 10 4196.
## 2 0.0000000001 1 40 rmse standard 50528. 10 4196.
## 3 0.0000000001 3 32 rmse standard 50528. 10 4196.
## 4 0.0000000001 3 40 rmse standard 50528. 10 4196.
## 5 0.0000000001 6 32 rmse standard 50528. 10 4196.
## # ℹ 1 more variable: .config <chr>
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 1 32 mae standard 44248. 10 4187.
## 2 0.0000000001 1 40 mae standard 44248. 10 4187.
## 3 0.0000000001 3 32 mae standard 44248. 10 4187.
## 4 0.0000000001 3 40 mae standard 44248. 10 4187.
## 5 0.0000000001 6 32 mae standard 44248. 10 4187.
## # ℹ 1 more variable: .config <chr>
Interpretation
RMSE: The best result occurs when, on average, the model on the training data fails by 50527.65 kWh in predicting energy consumption.
MAE: The best result occurs when, in absolute terms, the model on the training data fails by 44247.79 kWh in predicting energy consumption.
#Fit decision tree on training data
tree_train <- tree_spec %>%
finalize_model(select_best(tree_result, metric ="rmse")) %>% #Set metric to "RMSE"
fit(`kWh` ~ `Volm3`+`DO`, data = train)## Warning: ! 32 samples were requested but there were 28 rows in the data.
## ℹ 28 samples will be used.
#Fit decision tree on evaluation data
tree_test <- tree_spec %>%
finalize_model(select_best(tree_result, metric = "rmse")) %>% #Set metric to "RMSE"
fit(`kWh` ~ `Volm3`+`DO`+`MLTSS`, data = test)## Warning: ! 32 samples were requested but there were 8 rows in the data.
## ℹ 8 samples will be used.
#Prediction on training data
pred_tree_train <- tree_train %>%
predict(train) %>%
bind_cols(train)
#Prediction on evaluation data
pred_tree_test <- tree_test %>%
predict(test) %>%
bind_cols(test)#Decision tree performance on training data
pred_tree_train %>%
regression_metrics(truth = `kWh`, estimate = .pred)## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 44163.
## 2 mae standard 37233.
#Decision tree performance on evaluation data
pred_tree_test %>%
regression_metrics(truth = `kWh`, estimate = .pred)## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 36246.
## 2 mae standard 29944.
Findings The model performed slightly better during cross-validation than on the test set, as indicated by lower RMSE and MAE values in the evaluation phase. This may suggest a minor overfitting to the training folds or natural differences between the evaluation and test data. However, the difference is small, indicating that the model remains reliable for making predictions.
Findings:
Cost-Complexity Parameter (α):
Tree Depth:
Minimal Node Size (min_n):
Summary: